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Abstract Variable selection in cluster analysis is important yet challenging. It can be achieved by regu¬ 
larization methods, which realize a trade-off between the clustering accuracy and the number of selected 
variables by using a lasso-type penalty. However, the calibration of the penalty term can suffer from 
criticisms. Model selection methods are an efficient alternative, yet they require a difficult optimization 
of an information criterion which involves combinatorial problems. First, most of these optimization al¬ 
gorithms are based on a suboptimal procedure ( e.g . stepwise method). Second, the algorithms are often 
greedy because they need multiple calls of EM algorithms. Here we propose to use a new information 
criterion based on the integrated complete-data likelihood. It does not require the maximum likelihood 
estimate and its maximization appears to be simple and computationally efficient. The original contri¬ 
bution of our approach is to perform the model selection without requiring any parameter estimation. 
Then, parameter inference is needed only for the unique selected model. This approach is used for the 
variable selection of a Gaussian mixture model with conditional independence assumption. The numerical 
experiments on simulated and benchmark datasets show that the proposed method often outperforms 
two classical approaches for variable selection. The proposed approach is implemented in the R package 
VarSclLCM available on CRAN. 

Keywords Gaussian mixture model • Information criterion ■ Integrated complete-data likelihood • 
Model-based clustering • Variable selection 


1 Introduction 

Clustering allows us to summarize large datasets by grouping individuals into few characteristic classes. It 
aims to discover an a priori unknown partition among the individuals. In many cases, this partition may 
be best explained by only a subset of the observed variables. So, by performing the variable selection in 
the cluster analysis, both model fitting and result interpretation are facilitated. Indeed, for a fixed sample 
size, a variable selection method can provide a more accurate identification of the classes. Moreover, such 
methods bring out the variables characterizing the classes . 

Regularization methods can be used to achieve variable selection in clustering. One can cite the 
approaches of Friedman, J.H. and Meulman, J.J. (2004) or Pan, W. and Shen, X. (2007). Recently, these 
methods have been outperformed by the sparse K-means proposed by Witten and Tibshirani (2010). It 
uses a lasso-type penalty to select the set of variables relevant to clustering. Since it requires small 
computational times, it can manage high-dimensional datasets. Moreover, the selection of the number 
of classes is a difficult issue since probabilistic tools are not available. Finally, its results are sensitive to 
structure of the penalty term. 
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Model selection approaches can be used to carry out the variable selection in a probabilistic frame¬ 
work. Tadesse, M.G. and Sha, N. and Vannucci, M. (2005) consider two types of variables: the set of the 
relevant variables and the set of the irrelevant variables which are independent of the relevant ones. This 
method has been extended by Raftery and Dean (2006) by using a greedy search algorithm to find the 
set of relevant variables. Obviously, this algorithm finds only a local optimum in the space of models. It 
is feasible for quite large datasets because of its moderate computing time. Still, this method remains 
time consuming since the model comparisons are performed by using the BIC criterion (Schwarz, 1978). 
Therefore, the maximum likelihood estimate must be computed for each competing model. These es¬ 
timates are mainly instrumental since the practitioner interprets only the estimate related to the best 
model. 

In this paper, we propose a new information criterion, named MICL criterion (Maximum Integrated 
Complete-data Likelihood), for carrying out the variable selection in model-based clustering. This cri¬ 
terion is quite similar to the ICL criterion (Biernacki, C. and Celeux, G. and Govaert, G., 2010), and it 
inherits its main properties. However, these two criteria evaluate the integrated complete-data likelihood 
at two different partitions. The MICL criterion uses the partition maximizing this function, while the 
ICL criterion uses the partition provided by a MAP rule associated to the maximum likelihood estimate. 

In this article, we focus on variable selection for a Gaussian mixture model with conditional inde¬ 
pendence assumption, but the method can be extended to more general mixture models. Note that this 
model is useful especially when the number of variables is large (Hand and Kerning, 2001). Moreover, 
the conditional independence is often an hidden assumption made by distance-based methods like the 
K-means algorithm (Govaert, 2009). The MICL criterion takes advantage of the closed form of the inte¬ 
grated complete-data likelihood when the priors are conjugated. The model selection is carried out by a 
simple and fast procedure which alternates two maximizations for providing the model maximizing the 
MICL criterion. The convergence properties of this algorithm are similar to the convergence properties 
of the EM algorithm. In particular, it converges to a local optimum of the function to maximize. So, 
multiple random initializations are required to ensure its convergence to the global maximum. 

The proposed method and the methods of Witten and Tibshirani (2010), and of Raftery and Dean 
(2006) are compared on simulated and on challenging real datasets. We show that the proposed method 
outperforms both other methods in terms of model selection and partitioning accuracy. It often provides a 
model with a better value of the BIC criterion than the algorithm of Raftery and Dean (2006), although 
it does not directly optimize this criterion. Finally, we show that the proposed method can manage 
datasets with a large number of variables and a moderately large number of individuals. Note that it is 
the most common situation which requires variable selection in cluster analysis. 

The paper is organized as follows. Section 2 briefly reviews the framework of variable selection for the 
Gaussian mixture model. A presentation of the integrated complete-data likelihood is done in Section 3 
before introducing the MICL criterion. Section 4 is devoted to the inference based on the MICL criterion. 
Section 5 illustrates the robustness properties of the MICL criterion and compares the three methods 
of variable selection on simulated data. Section 6 compares the three methods of variable selection on 
challenging datasets. The advantages and limitations of the method are discussed in Section 7. 


2 Variable selection for Gaussian mixture model 

2.1 Mixture model of Gaussian distributions 

Data to analyze are n observations x = (x \,..., x n ), where object aq = (xn ,..., Xid) is described by d 
continuous variables defined on WL d . Observations are assumed to arise independently from a Gaussian 
mixture model with g components, assuming conditional independence between variables. Therefore, the 
model density is written as 

g d 

f(xi\m,9) = (1) 

k= 1 j=1 

where m specifies the model, 9 = (fx,c t,tv) is the whole parameter vector, 7r = (ni,..., Tr g ) is the 
vector of mixing proportion defined on the simplex of size g, ft = (pikj',k = 1 = 1 ,...,d), 

er = (ofcjj/c = 1 = l,...,d), and where (f>(.\fj,kj , crjh) is the density of a univariate Gaussian 

distribution with mean pkj and variance crjv. 
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A variable is said to be irrelevant to the clustering if its one-dimensional marginal distributions are 
equal between components. Thus, by introducing ojj such that ujj = 0 if variable j is irrelevant and 
u>j = 1 if the variable is relevant for the clustering, the following equalities hold: 

Vj G {/ : Uj> = 0}, nij = ... = Hgj and cry = ... = a gj . (2) 

Thus, a model m = (g,u>) is defined by a number of components g and the binary vector u) = (ujj ; j = 
1,,d) which encodes whether each of d possible variables are relevant to the clustering. 


2.2 Model selection based on the integrated likelihood 


Model selection generally aims to find the model rh which obtains the highest posterior probability 
among a collection of competing models M. So, 

rn = argrnax me7W p(m|x). (3) 

This model selection approach is consistent since rh converges in probability to the true model m as 
long as the true model belongs to the model space ( i.e. if m G Ad). 

By assuming uniformity for the prior distribution of m, rh maximizes the integrated likelihood defined 

by 

rh = argmax mgA1 p(x|m) with p(x|m) 

= f p(x\m,0)p(0\m)d6 , (4) 

J0 m 

where & m is the parameter space of model m, p(x\m,0) = Il/Li /(®*|wi, 0 ) is the likelihood function 
and p(0\m) is the prior distribution of the parameters. We assume independence between the prior, so 


d 

p(0\m) =p(ir\m)Y[p(<rij,IJ’. j \m), 
3=1 

where cr'A = (crjb; k = 1,..., g) and /xA = {p 2 kj ; k = 1,..., g), and 

k =1 


( 5 ) 


We use conjugate prior distributions, thus n\m follows a Dirichlct distribution T> g (i,..., |) which is the 
Jeffreys non informative prior (Robert, 2007). Moreover, a kj \m follows an Inverse-Gamma distribution 
lG(atj/2,l3j/2) and p,kj\m,(T%j follows a Gaussian distribution Af(Xj,a k: j/dj), where (aj,0j,\j,5j) are 
hyper-parameters. 

Unfortunately, the integrated likelihood is intractable. However, many methods permit to approximate 
its value (Friel, N. and Wyse, J., 2012). The most popular approach consists in using the BIC criterion 
(Schwarz, 1978), which approximates the logarithm of the integrated likelihood by Laplace approximation 
and requires maximum likelihood estimation. The BIC criterion is written as 

BIC(m) = lnp(x|m, d m ) — Inn, (6) 

where 6 m is the maximum likelihood estimate related to model m when iy m is the number of parameters 
required by m. 

For a fixed value of g , the variable selection in clustering necessitates the comparison of 2 d models. 
Therefore, an exhaustive approach which approximates the integrated likelihood for each competing 
model is not doable. Instead, Raftery and Dean (2006) carry out the model selection by deterministic 
algorithms (like a forward method) which are suboptimal. Moreover, they are time consuming when 
the number of variables is large, because they involve many parameter estimations for their model 
comparisons. 

All maximum likelihood estimates are mainly instrumental: they are only used for computing the 
BIC criterion, with the exception of the estimates related to the selected model rh which are interpreted 
by the practitioner. Therefore, we introduce a new criterion for model selection which does not require 
parameter estimates. 
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3 Model selection based on the integrated complete-data likelihood 

3.1 The integrated complete-data likelihood 

A partition is given by the vector z = (z i, ..., z n ) where Zi = (.zu , ..., Zi g ) indicates the class label 
of vector i, i.e. Zik = 1 if aj» arises from component k and Zik = 0 otherwise. In cluster analysis, z 
is a missing value. Thus, the likelihood function computed on the complete-data (observed and latent 
variables), called complete-data likelihood function, is introduced. It is defined by 

n g d 

p(x,zim, 0 )=Yi n k n<i > ( x ij\»kj,(7kj)) zik ■ ( 7 ) 

2=1 k—1 j =1 


The integrated complete-data likelihood is 


p(x, z|m) = / p{x.,z\m,Q)p{p\m)dQ. (8) 

■' & m 

Since conjugate prior distributions are used, the integrated complete-data likelihood has the following 
closed form 

d 

p( x ,z|m) =p(z\g) JJp(x.j| 5 ,Wj,z), (9) 

j'=i 

where x #J = (x^; i = 1 ,,n). More specihcally, 


p{z\g) 


n i) nLi nn k + ^ 

r {\)9 /> + §) 


( 10 ) 


where n k = E”=i z ik, and 

P(x.j|p,Wj,z) = 


(*) 


i/2 


re., (h 


n«k/ 2 r 


' n fc+°j 1 


^7 if = 0 




j+ n k 


( 11 ) 


— j-r- if uja = 1, 

n k +Oj J ’ 


where s] = /?J + E"*i(*« “ %) 2 + (i -i+ (tl +^-i) . = £ E?=i s | fe = $ + E2=i “ x ok) 2 + 

and For J such as “j = 0, we g et P{*»j\9,Uj, z) = p(x.j|p,a;j) 

since the partition does not impact the value of the integral. 


3.2 The ICL criterion 

The ICL criterion (Biernacki, C. and Celeux, G. and Govaert, G., 2010) carries out the model selection 
by focusing on the goal of clustering. It favors a model providing a partition with a strong evidence 
since it makes a trade-off between the model evidence and the partitioning evidence. The ICL criterion 
is defined by 

ICL(m) = lnp(x,z m |m), (12) 

where z m is the partition given by the MAP rule evaluated at the maximum likelihood estimate 6 m . 

When the model at hand is not the model used in the sampling scheme, the ICL criterion inherits 
robustness from this trade-off while the BIC criterion tends to overestimate the number of components. 
This phenomenon is illustrated in Section 5.1 by our numerical experiments. 

Although the ICL criterion has a closed form, it requires the maximum likelihood estimates to define 
the partition z m . The time devoted to parameter estimation can become computationally prohibitive. 
Therefore, in this work, we introduce a new criterion avoiding this drawback. 
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3.3 The MICL criterion 


We propose a new information criterion for model selection, named MICL criterion (Maximum Integrated 
Complete-data Likelihood). This criterion corresponds to the largest value of the integrated complete- 
data likelihood among all the possible partitions. Thus, the MICL criterion is defined by 

MICL(m) = lnp(x, z* m \m) with z * m = argmax z lnp(x, z| m). (13) 

Obviously, this criterion is quite similar to the ICL criterion and inherits its main properties. In 
particular, it is less sensitive to model misspecification than the BIC criterion. Unlike the ICL and the 
BIC criteria, it does not require the maximum likelihood estimates but it implies the estimation of 
which appears to be easily accessible (see Section 5 and Section 6 ). Among the models in competition, 
the selected model maximizes the MICL criterion and is denoted by m* with 

m* = argmax rrl 6 A 1 MICL(m). (14) 

The selected model m* is consistent when the number of components is known, see the proof in 
Appendix A. Nevertheless, like the ICL criterion, the MICL criterion lacks consistency to select the 
number of components if the component overlap is too strong. However, numerical experiments show its 
good behaviour to also select the right number of components (see Section 5.1). 


4 Model selection and parameter estimation 

The number of components of the competing models is usually bounded by a value <? ma x- So, the space 
of the competing models is written as 

M = {m = (g,u) : g G {1,..., g max } and u G {0, l} d } . (15) 

We denote by M g the restriction of M to the subset of the models having g components. The model 
m* maximizes the MICL criterion among the models belonging to M g . Therefore, 

m* g = argmax mg 7 Wg MICL(m) with M g = {( 3 , cu) : cu G (0, l} d }. (16) 

Thus, m* defines the best variable selection according to the MICL criterion for a fixed value of g. 
Obviously, 

m* = argmax g=1> . iSmax MICL (m*). (17) 

The estimation of m* implies to maximize the integrated complete-data likelihood on z). This maxi¬ 
mization is facilitated by the fact that u) does not influence the definition space of the vector of component 
membership. Indeed, z^ is defined on the same space {1,... , 3 }" for each model m in M g . This twofold 
optimization is performed by an iterative algorithm presented below. Note that this algorithm cannot be 
used to optimize the ICL or the BIC criterion, since Q us not defined on the same space for each model 
in A i g . We obtain m* by running this algorithm with g chosen from one to j m ax- 


4.1 Algorithm for MICL-based model selection 

The following iterative algorithm is used to find m* for any g in {1,... , 3 max}- Starting from the initial 
point (z[°l,m[°l) with mi 0 ! G M g , it alternates between two optimizations of the integrated complete- 
data likelihood: optimization on z, given (x, m), and maximization on u> given (x, z). The algorithm is 
initialized as follows: first mJ°l is sampled from a uniform distribution on M g , second z^l = i m [o] is 
the partition provided by a MAP rule associated to model and to its maximum likelihood estimate 
6 m [ 0 ]. Iteration [r] of the algorithm is written as 
Partition step: fix z^l such that 

lnp(x,z^|m^) > lnp(x,z^ 1 " - 1 !|m^). 

Model step: fix mi r+1 l = argmax m£jMs lnp(x, z^ \m) such that 

m [r+1] = ( 3 ,w [r+1] ) with J; +1] = argmax^ g{ 01 } p(x. i | 3 ,w i ,z [rI ). 


5 


The partition step is performed by an iterative method. Each iteration consists in sampling uniformly 
an individual which is affiliated to the class maximizing the integrated complete-data likelihood while 
the other class memberships are unchanged. 

Like an EM algorithm, the proposed algorithm converges to a local optimum of lnp(x, z|m). Thus, 
many different initializations should be used to ensure the convergence to m*. However, we show that 
this algorithm does not suffer from the problem of local optima during our applications (see Section 6). 


4.2 Maximum likelihood inference for the model maximizing the MICL criterion 


When model m* = (g*, ll>*) has been found, usually the estimate 6 m * maximizing the likelihood function 
is required: 

Om* = argmax eg@ ^p(x|m*,0). (18) 

The direct optimization of the likelihood function would involve to solve equations that have no analyt¬ 
ical solution. Instead, the parameter estimation is performed via an EM algorithm (Dempster, A.P. and Laird, N.M. and R 
1977), which is often simple and efficient in the situation of missing data. This iterative algorithm alter¬ 
nates between two steps: the computation of the complete-data log-likelihood conditional expectation (e 
step) and its maximization (m step). Its iteration [r] is written as: 

E step: computation of the conditional probabilities 


t 


M 

ik 


K [ k 


>12 

T kj 




W 2 x ' 
k'j > 


M step: maximization of the complete-data log-likelihood 


r+l] 


t 


[r] 


r+l] 


n ’ ^ 


A =l u ' l J ^ = 0’ 


r]„ 

h Er =1 *Sb (*« - mL 7 1] ) 2 if u i = i 
1 £ E?=i(*« - / 4? 11 ) 2 if U 1 = °> 

where = E"=i ■ Note that the EM algorithm can provide the maximum a posteriori estimate by 
slightly modifying its M step (Green, P.J., 1990). 


[r+l]2 


5 Numerical experiments on simulated data 

Implementation of the proposed method Results of our method (indicated by MS) are provided by the 
R package VarSelLCM available on CRAN. This package performs 50 random initializations of the 
algorithm described in Section 4.1 to carry out the model selection. The following hyper-parameters are 
chosen to be fairly flat in the region where the likelihood is substantial and not much greater else-where: 
aj = 1, 8j = 1, A j = mean(x,j) and 5j = 0.01. 

Competing methods 

— The model-based clustering method of Raftery and Dean (2006) is denoted RD in what follows. It 
runs using the R package clustvarsel (Scrucca and Raftery, 2014). Results are given by the headlong 
algorithm for the forward direction (denoted RD-forw) and by the greedy algorithm in the backward 
direction (denoted RD-back). 

— The sparse K-means method of Witten and Tibshirani (2010) is not a model-based approach. It runs 
using the R package sparcl (Witten and Tibshirani, 2013) with its options by default. In what follows, 
this method is indicated by WT. 

Simulation map First, information criteria are compared on datasets sampled from the well-specified 
model and on datasets sampled from a misspecified model. Second, the three competing methods of 
variable selection are compared. The calculations are carried out on an 8 Intel Xeon 3.40GHZ CPU 
machine. 
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5.1 Comparing model selection criteria 

5.1.1 Simulated data: well-specified model 

Here we compare model selection criteria when the sampling model belongs to the set of the competing 
models. Individuals are drawn from a bi-component Gaussian mixture model with conditional indepen¬ 
dence assumption. The first two variables are relevant to the clustering and the last two variables are 
not. Thus, the true model denoted by m is 

m(°) = (g ( 0 ) ,u> (0) ) with g (0) = 2 and w (0) = ( 1 , 1 , 0 , 0 ). 

The following parameters are used: 

7Tfc = 0.5, flu = fi 12 = e, fi 21 = P 22 = —Mfc 3 = Mfc 4 = o and a kj = 1. 

The value of s defines the class overlap. Table 1 presents the results obtained for different sample sizes 
and for different class overlaps. For each case, 100 samples are generated and the criteria are computed 
for all the possible models in A4 with g max = 6. 


n 


£ 

criterion 

50 

100 

200 

400 

800 

1.26 

BIC 

95 (85) 

100 (99) 

100 (100) 

100 (100) 

100 (100) 


ICL 

85 (72) 

98 (95) 

100 (97) 

100 (97) 

100 (99) 


MICL 

86 (73) 

98 (94) 

100 (97) 

100 (98) 

100 (99) 

1.05 

BIC 

85 (77) 

99 (94) 

100 (98) 

100 (99) 

100 (100) 


ICL 

45 (42) 

69 (62) 

98 (94) 

100 (99) 

100 (100) 


MICL 

50 (45) 

69 (62) 

99 (96) 

100 (99) 

100 (100) 

0.85 

BIC 

46 (32) 

73 (69) 

98 (94) 

100 (99) 

100 (100) 


ICL 

9 (7) 

15 (13) 

12 (9) 

27 (26) 

31 (31) 


MICL 

10 (8) 

16 (15) 

13 (11) 

31 (30) 

35 (35) 


Table 1 Results of model selection for different information criteria under the true model. In plain, percentage where the 
true number of components has been selected. In parenthesis, percentage where the true model (ml 0 )) has been 

selected. 


When the class overlap is not too high, all the criteria are consistent. Indeed, they asymptotically 
always select the true model. In such a case, the BIC criterion outperforms the other criteria when the 
sample size is small. When the class overlap is equal to 0.20, the BIC criterion stays consistent while the 
other ones select only a single class. However, we now show that the BIC criterion suffers from a lack of 
robustness. 

5.1.2 Simulated data: misspecified model 

We look at robustness of the criteria based on the integrated complete-data likelihood. Again, the first 
two variables contain the relevant clustering information. They follow a bi-component mixture model of 
uniform distributions with conditional independence assumption and equal proportions. More specifically, 
they are generated independently from the uniform distribution on [e — 1 , e + 1] for the first component 
and the uniform distribution on [— e— 1, — £ +1] for the second. The remaining two variables are irrelevant 
variables that are independent of the clustering variables and follow two independent standard Gaussian 
distributions. For each case, 100 samples are generated and the criteria are computed for all the possible 
models in M with g max = 6 . Table 2 summarizes the selection results for each criterion. 

Results show that the BIC criterion is not useful to select the number of components. Indeed, it 
overestimates the number of classes to better fit the data since the sampling model does not belong to 
the set of the competing models. The other criteria show considerably better performance since they 
select the true number of classes and the true u>. It appears that they are more robust than the BIC 
criterion to the misspecification of the model at hand. 

To conclude, the ICL and the MICL criteria obtain good results for model selection when the class 
overlap is not too strong. Moreover, they are more robust to model misspecification than the BIC 
criterion. Since the MICL criterion does not require maximum likelihood inference for all of the competing 
models, it is preferable to the ICL criterion for carrying out model selection. 
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£ 

criterion 

50 

100 

200 

400 

800 

1.26 

BIC 

79 (75) 

80 (79) 

48 (48) 

0 (0) 

0 

(0) 


ICL 

100 (96) 

100 (98) 

100 (100) 

100 (99) 

97 

(97) 


MICL 

100 (95) 

100 (98) 

100 (100) 

100 (99) 

96 

(96) 

1.05 

BIC 

86 (83) 

83 (80) 

49 (48) 

4 (4) 

0 

(0) 


ICL 

100 (91) 

100 (95) 

100 (98) 

99 (98) 

99 

(98) 


MICL 

100 (92) 

100 (99) 

100 (98) 

98 (98) 

99 

(98) 

0.85 

BIC 

80 (78) 

72 (71) 

36 (36) 

0 (0) 

0 

(0) 


ICL 

97 (87) 

100 (96) 

100 (98) 

99 (97) 

97 

(97) 


MICL 

97 (92) 

100 (98) 

100 (98) 

99 (97) 

97 

(97) 


Table 2 Results of model selection for different information criteria under the non-Gaussian model. In plain, percentage 
where the true number of components (g = 2) has been selected. In parenthesis, percentage where the true number of 
classes and the true partitioning of the variable (w = (1,1, 0, 0)) have been selected. 


5.2 Comparing methods on simulated data 

Data are drawn from a tri-component Gaussian mixture model assuming conditional independence and 
equal proportions. The first r variables are relevant while the last d — r variables are irrelevant since 
they follow standard Gaussian distributions. Under component k , the first r variables follow a spherical 
Gaussian distribution 7V(/x fc ; J) with n 1 = —/x 2 = (e, ...,e) £ R r and /x 3 = 0 r . The three competing 
methods are compared on five different scenarios described in Table 3. 



n 

r 

d 

£ 

Scenario 1 

30 

5 

25 

0.6 

Scenario 2 

30 

5 

25 

1.7 

Scenario 3 

300 

5 

25 

1.7 

Scenario 4 

300 

5 

100 

1.7 

Scenario 5 

300 

50 

500 

1.7 


Table 3 The five scenarios used for the method comparisons. 


For each scenario, 25 samples of size n are generated and the analysis is performed with g = 3. 
Results are presented in Table 4. Note that the RD method is run only on the smaller scenarios for 
computational reasons. 


Scenario 

Method 

NRV 

RRR 

RIR 

ARI 

Time 

i 

MS 

1.28 

0.08 

0.95 

0.01 

0.42 


WT 

8.28 

0.38 

0.68 

0.06 

1.78 


RD-forw 

3.96 

0.15 

0.84 

0.06 

1.15 


RD-back 

11.00 

0.42 

0.55 

0.03 

1.30 

2 

MS 

5.40 

1.00 

0.98 

0.66 

0.47 


WT 

12.88 

0.96 

0.59 

0.59 

1.70 


RD-forw 

3.28 

0.15 

0.87 

0.13 

0.97 


RD-back 

10.64 

0.58 

0.61 

0.37 

49.01 

3 

MS 

5.00 

1.00 

1.00 

0.86 

16.21 


WT 

25.00 

1.00 

0.00 

0.87 

7.97 


RD-forw 

5.52 

0.96 

0.96 

0.82 

6.42 


RD-back 

5.64 

1.00 

0.97 

0.86 

30.65 

4 

MS 

5.00 

1.00 

1.00 

0.88 

73.36 


WT 

100.00 

1.00 

0.00 

0.89 

21.22 


RD-forw 

6.96 

0.92 

0.97 

0.81 

38.05 


MS 

50.04 

1.00 

1.00 

1.00 

48.37 

5 

WT 

500 

1.00 

0.00 

1.00 

83.49 


Table 4 Comparing variable selection methods on simulated data. Means of the numbers of relevant variables (NRV), the 
right relevant rates (RSR), the right irrelevant rates (RIR), the Adjusted Rand Indices (ARI) and the computing times in 
second (Time). 















When the sample size is small and when the component overlap is high (Scenario 1), all methods 
obtain poor results. However, when the component separation increases (Scenario 2), MS method leads 
to a better variable selection (NRV, RRR, RIR) which involves a better partitioning accuracy (ARI). 
Indeed, WT selects some variables which are not discriminative and which damage the partitioning 
accuracy. Both directions of RD lead to only average results. 

When the sample size increases (Scenarios 3, 4 and 5), the MS results for the variable selections 
are ameliorated and the true model is almost always found. In this context, WT claims that all the 
variables are relevant to the clustering. Since the sample size is not too small, its partitioning accuracy 
is not damaged but the model interpretation is strongly harder since all the variables should be used 
for characterizing the classes. In this context, the RD results are good when the number of variables 
is small, but they are damaged when many variables are observed. Moreover, when many variables are 
observed (Scenarios 5), the multiple calls of EM algorithm for the model comparisons prevent the using 
of this method computational reasons. 

During these experiments, the algorithm assessing m* does not suffer from strong problems of local 
optima except for Scenario 1. Indeed, Table 5 presents statistics of the occurrence where the couple 
(z^,, m*) has been found by the algorithm for the 50 random initializations. 



Scenario 1 

Scenario 2 

Scenario 3 

Scenario 4 

Scenario 5 

Mean 

13 

17 

31 

19 

45 

Min 

1 

4 

9 

8 

10 

Max 

35 

31 

45 

28 

48 


Table 5 Mean and minimal occurrence where the couple has been found by the algorithm for the 50 random 

initializations for the 25 generated data. 


6 Numerical experiments on benchmark data 

We now compare the competing methods on real datasets in which the correct number of groups is 
known. Theses data sets are presented in Table 6. 


Name 

d 

n 

S<°> 

Reference 

R package/website 

banknote 

6 

200 

2 

Flury, B. and Riedwyl, H. (1988) 

VarSelLCM 

coffee 

12 

43 

2 

Streuli (1973) 

ppgm 

wine 

13 

178 

3 

Forina and al (1991) 

UCI 

cancer 

30 

569 

2 

Street et al. (1993) 

UCI 

golub 

3051 

83 

2 

Golub and al. (1999) 

multtest 


Table 6 Information about the benchmark data sets. 


Our study is divided in two parts: first the three competing methods are compared by assuming the 
component number known, second the model-based methods are compared by assuming the component 
number unknown. Because RD-back is very slow, we perform the RD method only with the forward 
approach. The experimental conditions are similar to those described in the previous section. Since the 
sparse K-means method does not provide an automatic procedure to select the number of groups, this 
method is not not used in the second part. 


6.1 Component number known 

Table 7 presents the results obtained when the number of components is known. 

The first three data sets are the less challenging since the number of variables is moderate. However, 
MS provides an easier interpretable model than WT (less relevant variables) which provides an identical 
partitioning accuracy. Surprisingly, MS leads to a model having a better BIC value than RD while this 
latter method aims to maximizing this criterion. This phenomenon illustrates the sub-optimality problem 


9 






Data NRV ARI Time BIC 



MS 

WT 

RD 

MS 

WT 

RD 

MS 

WT 

RD 

MS 

RD 

banknote 

5 

6 

4 

0.96 

0.96 

0.98 

2.1 

2.3 

1.6 

-968 

-1009 

coffee 

5 

12 

5 

1.00 

1.00 

1.00 

0.1 

0.9 

0.7 

-522 

-555 

wine 

11 

13 

5 

0.87 

0.85 

0.73 

5.4 

2.4 

1.5 

-3538 

-3769 

cancer 

15 

30 

17 

0.75 

0.70 

0.65 

50 

17 

62 

2189 

2569 

golub 

553 

3051 

10 

0.79 

0.11 

0.00 

34 

54 

258 

-90348 

-95255 


Table 7 Results obtained when the component number is known: number of relevant variables (NRV), adjusted rand index 
computed on the selected model (ARI), computing time in seconds (Time) and BIC criterion value. 


of the RD procedure. Moreover, note that, by selecting only five variables on wine data set, RD provides 
a less relevant partition. 

For the larger data sets, RD obtains better results. Indeed, it is more easily interpretable and it 
provides a more accurate partition. For instance, on cancer data set, MS obtains a better ARI than both 
other methods while it selects less variables. Concerning golub data set, WT relates all the variables 
while RD selects only ten variables. These methods also obtain a worse partition than MS. Finally, note 
that MS obtains a better value of the BIC criterion than RD on golub data set. 

Table 8 shows that the couple is easily accessible by the proposed algorithm. Thus, prob¬ 

lems due to multiple local optima do not occur on these data sets. Moreover, it indicates the values of 
the ICL and MICL criteria related to the model selected by MS. Even if the MICL value is always larger 
(or equal) than the ICL value, this difference is often small. Thus, the partitions z m * and are often 
quite similar. 


banknote coffee wine cancer golub 
Occurrence 48 27 11 48 6 

MICL -1009.2 -644.1 -3715.7 -7963.5 -103858.8 

ICL -1009.2 -644.1 -3715.8 -8064.1 -103858.8 


Table 8 Ocurrence where the couple . m*) has been found by the algorithm for the 50 random initializations and 

values of the information criteria for model m*. 


6.2 Component number unknown 


Table 9 presents the results obtained when the number of components is unknown by setting c/ max = 6. 


Data g NRV ARI BIC 



MS 

RD 

MS 

RD 

MS 

RD 

MS 

RD 

banknote 

3 

5 

6 

4 

0.61 

0.40 

-926 

-978 

coffee 

2 

3 

5 

6 

1.00 

0.38 

-522 

-521 

wine 

4 

4 

11 

6 

0.67 

0.72 

-3502 

-3739 

cancer 

6 

6 

13 

15 

0.21 

0.23 

4192 

4861 

golub 

2 

6 

553 

8 

0.79 

0.00 

-90348 

-95098 


Table 9 Results obtained when the component number is unknown: number of components ( g ), number of relevant 
variables (NRV), adjusted rand index computed on the selected model (ARI) and BIC criterion value. 


On these data sets, RD selects more components than MS. Thus, its interpretation becomes more 
complex even if RD can select less variables. The previous remarks are validated by this application. 
Indeed, we observe than MS can provide a model with a better BIC value. Moreover, when the number of 
variables increases, the results of RD are damaged due to the lack of optimality. This is shown particularly 
by golub data set. 


10 
















7 Discussion 


We have proposed a new information criterion to carry out model selection of a finite mixture model. 
This criterion can be used for selecting the relevant variables for model-based clustering in Gaussian 
mixture settings assuming conditional independence. In such a case, the criterion has a closed form and 
the model maximizing it is accessible by an algorithm of alternated optimization. Its originality consists 
in allowing a model selection procedure which does not require the maximum likelihood estimate. 

The criterion can be easily used when the model at hand is a mixture of distributions belonging 
to an exponential family. Indeed, in such cases, the closed form is preserved. Thus, the MICL criterion 
can carry out variable selection in a cluster analysis of categorical or mixed datasets by using the 
model of Celeux and Govaert (1991) and of Moustaki, I. and Papageorgiou, I. (2005) respectively. The 
application of the proposed method to the categorical data appears to be especially pertinent since 
conditional independence assumption is often assumed for such data, and since Jeffreys non informative 
prior distributions are available. 

If the conditional independence assumption is relaxed, the algorithm used for model selection should 
be modified. Then, its model step is not explicit but it can be achieved by a MCMC method. 

We have compared our method with two standard procedures of variable selection in cluster analysis. 
It was shown that the proposed method outperforms both other ones for the task of variable selection. 
It results in a better partitioning accuracy. In a moderate computing time, the proposed method can 
manage datasets with a large number of variables and a relatively large number of individuals. However, 
the procedure of model selection is time-consuming if a huge number of individuals is observed. In such a 
case, the optimization of the model selection procedure is an issue which calls for further improvements. 

Finally, this approach could be extend to perform a more elaborated variable selection. Indeed, by 
using the approach of Maugis, C. and Celeux, G. and Martin-Magniette, M. (2009). 

The R package VarSelLCM implementing the proposed method is downloadable on CRAN. 
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Appendices 

A Consistency of the MICL criterion 

This section is devoted to the proof of consistency of our MICL criterion with a fixed number of com¬ 
ponents. The first part deals with non-nested models and requires a biais- entropy compensation as¬ 
sumption. The second part covers the nested models, i.e, when the competing model contains the true 
model. In what follows, we consider the true model = (</°\ , its set of relevant variables is 

12^ = |j : w|°’ > = 1 j and the parameter is 0 (n ^. 

Case of non-nested model We need to introduce the entropy notation given by 

n g 

£(0; z, m) =^2^2 Zik ln T ik{0 \m), 


i= 1 k=l 


Kkffjxi | 6 k, m) 


where r ik (6 \ m) = , v 

lJ h 'Kh<P\Xi | 9 h ,m) 

Proposition 1 Assume that m,Al is a model such that m/°) is a non-nested within Assume that 


-E 


„( 0 > 


eCi ** n-=i 0(*ii i /4?» a kj 2 ) 1 gl° ) (*i) 


In 


< KL 


mWllmW 


(19) 


where KL 


I Im/ 1 ) 


p(x i | 0 <O) ,ml 0 )) 

is the Kullback-Leibler divergence of p(- | from p(- | and 


G [, 0) = < x £ ! d : k = argmax n h TT \ cr^ 0) ) > . 

' l<h<gW ** 1 


When n —> oo, we have 


P ^MICL(m ( - 1) ) > MICL(m w )j —> 0. 


12 










Proof For any model m , we have the following inequalities, 


ICL(m) < MICL(m) < lnp(x | m). 


MICL(m (1) ) - MICL(m (0) ) > 0 l < P\ lnp(x | m (1) ) - ICL(m (0) ) > 0 l. 


It follows, 


Now set Au = z/ 1 ) — where I'W and are the numbers of free parameters in the models 
and m^ 0 ' 1 respectively. Using Laplace’s approximation, we have 

(o) 


ICL(m^) = lnp ^x | O' ^ 6 1 \-— Inn + O p ( 1), 


where 6 * and z^ 0 ) are respectively the MLE and the partition given by the corresponding MAP rule. 
In the same way, we have 

lnp(x | m/ 1 -*) = lnp(x | 6^ \ m 1 ' 1 ' 1 ) -— Inn + O p { 1), 


dP . 


where 0 is the MLE of d^K Note that 


lnp(x | — ICL(m^) = —A + nB n -— Inn + O p { 1), 


where 


A„ = 2 In 


p ^x | 0* \ m.w'j p (x\ 0^ \ 

' ' -2In- ^ 


and 


B n = 


(x | 6^\ mX 1 )^ p ^x | m(°)^ 

i p (x | 0 (1) , m/ 1 )) i / f0 N \ 

I In - V- l - ( 0 (O) ; z(°), m<°> ) . 

n p fx | 8^°\ m(°) 1 n V / 


When n —>• oo, we have A n —>• Xav i n distribution and B n tends to 


-KL 




-E 


„«» 


eCi ** uU i > ct £ )2 ) 1 g'° ) (*0 


In 


p(a;i | m(°)) 

in probability. Thus, under the assumption (19), MICL is consistent since when n —> oo, we have 


p|MICL(m (1) ) - MICL(ra (0) ) > 0 j < P 


A n + O p ( 1) > Av In n 

0 . 


+ 1 


B n > 0 


Case of nested model Recall that MICL(m(°)) = lnp(x, z 1 ' 0 ' 1 | m/ 0 )), where = argmaxlnp(> 

Z 

m (°)). We have 

z (0) = argmaxj lnp (z | g (0) ) + ^ lnp(x.j- | wj 0) , .g (0) , z) |, 


jen o 


where h?o = |j : = l|. Let ra^ = (g(°), h?i) where b?i = 17 oUI2 0 i and l?oi = | j ■ = 1, = o|. 

Then, in the same way, we have MICL(m^ 1 ^) = lnp(x, z^ | m^), where 


= argmax 


hi p (z | g (0) ) + ^2 lnp (x..j | z) 


iefii 
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Let j € i?oij Laplace’s approximation gives us, 


In V (x.j | 




wj- ',g x ',z 


ny 

^2 ^2 Zik ln ^ i ^ ’ 4j )2 j - # (0) ln ^+ °p (!) > 


i =1 /c—1 


where 


n 

(/4V, Pk) 2 ) = argmax V z ik ln cf> (. 
\ J (i) ( 1)2 —7 \ 

»=* 


x - I U (1) cr (1)2 
I r^kj J U kj 


)• 


Proposition 2 Assume that m W *s a model such that = g and fl\ = l?o U L?oi where / 2 oi 7^ 0 , 
Le, t/ie model m/°) is nested within the model m W wit/i </ie same number of components. When n — » 00, 


p(M/CL(m (1) ) > MICL(m {0) ) ) —> 0. 


Proof We have 


MICL(m (1) ) > MICL(m (0) ) < P ^ E hl ^ ^ ^ ^ > 0 


L jefio 


! p(x.j I wj 0 ) ,g(°),z( 0 )) 


And for each j G i?oi 5 when n —>• oo 


n 5 


( 0 ) 


^EE^ln 


2=1 k—1 


'(*« 1 / 4 VEV 2 ) 

*« I d?E? 2 ) 


■( 


X 2 g (o) in distribution. 


We have 


.. 

—1 0 by Chebyshev’s inequality. 


') 
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